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Abstract 

Community detection has been one of the central problems in network studies and directed 
network is particular challenging due to asymmetry among its links. In this paper, we 
found that incorporating the direction of links reveals new perspective on communities 
regarding to two different roles, source and terminal, that a node plays in each community. 
Intriguingly, such communities appear to be connected with unique spectral property of the 
graph Laplacian of the adjacency matrix and we exploit this connection by using regularized 
SVD methods. We propose harvesting algorithms, coupled with regularized SVDs, that are 
linearly scalable for efficient identification of communities in huge directed networks. The 
algorithm showed great performance and scalability on benchmark networks in simulations 
and successfully recovered communities in real networks applications (with ~ 2 million 
nodes and ~ 50 million edges). 

Keywords: Community extraction, Graph Laplacian, Regularized SVD, Scalable algo- 
rithm, Social networks 



1. Introduction 



Many real world problems can be effectively modeled as complex relationship in networks 
where nodes represent entities of interest and links mimic the interactions or relationships 
among those nodes. The study of such complex relationship networks, recently referred to 



as network science, can provide insight into their structures and properties (Newman, 2006). 



One particularly interesting area in studies of network structures is searching for important 
sub- networks which are called communities, modules or groups. A community in a network 
is typically characterized by a group of nodes that have more links connected within the 
community than connected to out of the community (Fortunato, 2010). Community de- 



tection is in growing attention not only because it leads to understanding of the complex 
network structure, but also it allows further analysis such as studies on information flows 
on in networks, evolution of networks and visualization of networks. 

Upon the advancement of technology, the size of the network we encounter today is way 
beyond what we traditionally are able to handle. Modern networks such as social networks, 
citation networks, Internet networks simply exceed millions of nodes. Consequently, recent 
developments of community detection algorithms pay considerable attention to the scala- 
bility of the algorithms. A survey paper, iParthasarathy et al. (2011), summarizes recent 



developments in scalable algorithms such as Meits (iKarypis and Kumarl |1998[), Graclus 



dDhillon et aL}|2007|), ML R-MCL ( |Satuluri and Parthasarathyj |2009| ) and local graph clus- 
tering of Andersen et al. (2006). However, those scalable community detection algorithms 
are mainly designed for undirected networks. 

In many practical applications, there is a large number of networks that are directed 
in nature, such as the World Wide Web, tweeter's follower- followee network, and paper 
citation networks. Even though a few algorithms developed for undirected networks can 



be extended to apply on directed networks (Danon et al. , 2005; Fortunato, 2010 Coscia 



et al., 2011), the notion of community in undirected networks cannot be simply translated 
to the directed ones. A common approach to handle directed networks is symmetrizing 
the adjacency matrix and treating the resulted matrix as from an undirected network. 
However, it is not uncommon to see that ignoring the direction of edges results in abnormal 
communities (Leicht and Newman, 2008). 

The aim of the present work is to develop scalable community detection algorithms that 
explicitly incorporate the direction of links. The nodes and links in a directed network are 
often presented by a graph Q — (V, £), with V = {^i, . . . , v n } and £ = {e\ . . . , e m } denoting 
the vertex set and edge set respectively. For an existing edge e in the network, we denote 
its source node and terminal node as v s (e) and v t (e) respectively. Let W be the |V| x |V| 
adjacency matrix in which W(i,j) = 1 indicates the existence of an edge originated from v\ 
and pointed to vj and W(i,j) = otherwise. 



Compared with in-depth studies of community structures in undirected networks (Danon 



et al. , 2005; Fortunato, 2010; Coscia et al. , 2011 ), community detection in directed networks 



has not been as fruitful. One particular difficulty in studying the structure of directed 
networks is the lack of a clear definition of the connectivity between each pair of nodes. As 
a result, it is hard to define communities due to the asymmetry nature of the edges. Existing 
works (Kleinberg, 1999) pointed out the importance of recognizing the dual roles, as source 
and terminal of edges, that nodes play in a directed network. In this paper, we concentrate 
on directly incorporating directional edges in analysis and we start with defining a new 
type of community, Directional Component, which contains a source part and a terminal 
part based on the connectivity between nodes following a path of the edges in alternative 
directions. 

From a theoretical point of view, we study the connection between the directional com- 
ponents in a directed network and the spectrum (singular values and singular vectors), of the 
graph Laplacian of the network. We prove that the number of directional components in a 
directed network equals to the multiplicity of the top singular values of the graph Laplacian. 
In addition, nodes corresponding to nonzero elements in each pair of top singular vectors 
form the source part and terminal part of one of the directional components. With these 



theoretical insights, we observe that the successful DI-SIM algorithm proposed in Rohe 



and Yu (2012) is able to find the source parts and the terminal parts of the directional 



components. Even with small number of external edges connecting the directional compo- 
nents, the DI-SIM algorithm still stands a good chance of finding approximate directional 
components due to the stability of the top spectrum of the graph Laplacian. 

To build scalable community detection algorithms, we explore the idea of extracting 
one community, like an approximate directional component, at a time. We propose to use 
regularized SVD on the graph Laplacian to achieve this goal. Two types of penalizations, 
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Lq and Elastic-net, are studied in this paper. The Lq regularized SVD leads to hard- 
thresholding of the vectors in the power methods iteration and Elastic-net regularization 
leads to soft-thresholding. We show that the corresponding threshold level in both methods 
can be calculated through direct deterministic searching algorithms, other than calling 
optimization routines. This direct searching step dramatically reduces the computation time 
and improves the scalability of the regularized SVD methods. Using the regularized SVD 
as building blocks, we design a community harvesting algorithm that extracts communities 
from a large network one at a time. 

We test the proposed community harvesting algorithms and compare them with existing 
methods on simulated networks and real networks. In our simulation study, we find that Lo 
harvesting and Elastic-net harvesting algorithms perform well when the network is sparse or 
the communities are tight, which is the case for most social networks. Through simulations, 
we also find the harvesting algorithm scale well to big networks, even with a large number 
of communities. We apply these algorithms to a citation network (30,228 nodes and 110,654 
edges) with manually assigned categories on papers and found the results from harvesting 
algorithms are largely consistent with the categories and outperform other methods. In 
addition, we check the scalability of our algorithms on a social network (1,944,589 nodes 
and 50,655,143 edges) and found the algorithms are applicable for such a large network and 
the results reveal interesting structures as well. 

The remainder of the paper is organized as follows. In Section [2j we review recent 
developments in community detection in directed networks and propose a new concept of 
community, Directional Component. Section [3] exploits the connections between directional 
components and the spectrum of the graph Laplacian of the adjacency matrix. Based on this 
connection, we propose regularized SVD algorithms for community extraction. Section [4] 
addresses the problem of selecting parameters for regularized SVD and develops community 
harvesting algorithm. Sect ion [5] contains simulation studies in various community structures 
and Section [6] shows the results of harvesting algorithms in two real-world networks. Finally, 
we conclude the paper with conclusions and discussion in Section [7| Theory proofs are 
detailed in Appendix. 



2. Directed Networks and Directional Components 

In this section, we first review existing efforts made in incorporating directional information 
for community detections in directed networks. Following the review, we extend those 
concepts to a new definition of community, Directional Component. We further discuss the 
properties of directional components and present a simple algorithm to identify them. 



2.1 Review: Community Detection in the Directed Networks 

A common approach to handle a directed network is symmetrizing the asymmetric adja- 
cency matrix and treating the resulted matrix as undirected. This approach may work 
reasonably well in cases where a directed network has large portion of symmetric edges. 
However, ignoring the direction of edges usually results in abnormal community detection 
results ( |Leicht and Newman , 2008). 

In the literature of community detection in directed networks, several authors has at- 
tempted to directly incorporate directional information into their algorithms. They can be 
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categorized as probability model based (Newman and Leicht 



et al., 2005; Andersen and Lang, 2006 



et al. 



et al. 



2007 



Rohe and Yu 



Arenas et al., 2007; Leicht and Newman 



2007[) , spectral based (|Capocci 



2012), modularity based (Guimera 



2008), and information based (Rosvall 



2009). In particular, several methods, Capocci et aLl (2005); Leicht and Newman 



(2008); Satuluri and Parthasarathy (2011); Rohe and Yul (2012), are based on concepts of 



in-link similarity and out-link similarity. The in-link (out-link) similarity between a pair 
of nodes is defined as the number of common nodes having edges pointing to (from) them. 
A modularity of a directed network was proposed based on these two similarity measures 
in Leicht and Newman (2008) and they suggested to find communities by maximizing this 



modularity. Satuluri and Parthasarathy (2011) proposed a degree-discounted symmetriza- 



tion method that combines the in-link similarity and the out-link similarity of a directed 
network so that existing community detection algorithms for undirected networks can be 
applied. 



It is also suggested in Rohe and Yu (2012) to detect communities by investigating the 



spectrum of graph Laplacian of the adjacency matrix W . The graph Laplacian Q of a 
directed network is defined as 



Q = D r *WD c \ 



(1) 



where D r is the diagonal matrix of out-degrees (or row sum of W) and D c is the diagonal 
matrix of in-degrees (or column sum of The proposed algorithm, DI-SIM algorithm, 

clusters nodes in two different ways (co-clustering) by running the k-means algorithm on the 
leading left singular vectors and right singular vectors separately. They showed that the 
DI-SIM algorithm may recover stochastically equivalent sender- nodes and receiver-nodes 
under Stochastic Co-Block model, which is a relaxed version of Stochastic Block model of 



Holland et al. (1983) 



A close look at these methods reveals that a key to the success of these methods is to 
recognize the dual roles that nodes play in a directed network, as source nodes and terminal 



nodes. Actually, Kleinberg (1999) differentiated these two different roles that webpages 



play in the World Wide Web network and proposed a hyperlink-induced topic search (HITS) 
algorithm to find important websites. The influence of all websites is characterized by a 
hub score vector (u) and an authority score vector (v) that are solutions of 



u — aWv, 
v = /3W^u, 



subject to u^u = v*v = 1. 



(2) 



for fixed constants a, (3. An interesting property of the two scores is that this pair of scores 



actually define each other. In his words (Kleinberg, 1999), there are reinforcement relation 



ship of nodes, which means that good hubs are the nodes that refer to good authorities and 
good authorities are the nodes that are referred by good hubs, as reflected in Q. Even 
though the HITS algorithm originally aims the global reinforcement relationship, it gives 
us a clue to a concept of community that is based on the local reinforcement relationship 
between hub nodes and authority nodes. 



1. We define [j = for convenient notations. 
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2.2 Directional Components 

As we have discussed so far, understanding the two different roles that a node plays in 
a directed network is crucial in incorporating the direction of edges into the concept of 
communities. We start with exploring different definition of connected nodes in this section. 
Two types of connectivity between nodes have been studied in the literature of directed 
networks. Weak connectivity defines two nodes s and t (s,t G V) as weakly connected if 
they reach each other through a path ei, e2, . . . , e/ 3 regardless of the directions of edges in 
the path. Meanwhile, Strong connectivity takes in account of the directions of the edges in 
the path and claims nodes s and t are strongly connected only if the path (ei, e2, . . . , e{) 
also satisfies v s (e\) — s,v t {e{) — t,v t (ej £ ) = v s (ek+i),k = 1,...,/ — 1. In this paper, we 
define a new type of connectivity, D- connectivity, as 

Definition 1 Two nodes s and t (s, t G V) are D-connected, denoted by s —± t, if there 
exists a path of edges (ei, . . . , e2 m -i), m ^ satisfying v s (e\) — s, v t (e2m-i) — t and 

v t (e2k-i) = (common terminal nodes) 

v s ( e 2k) — vS ( e 2k+i) (common source nodes) 

for k = 1, 2, . . . , max{m — 1, 1}. 

D-connectivity follows the edges in alternative directions, one forward and then backward. 
We call this sequence of edges a D-connected path. Figure [T] provides an illustration of 
D-connectivity. For example, we observed that A — >> D through the sequence of edges 
(e2, e3, e£) and E — >> A through the sequence of edges (es, e±,e{). 
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Figure 1: An example directed network and its adjacency matrix. 
The definition of D-connectivity is a restricted version of a concept called alternating 



connectivity that was introduced by Kleinberg (1999), in the context of analyzing the cen- 



trality of webpages in a sub-network of World Wide Web using the HITS algorithm. The 
difference is that the alternating connectivity allows two nodes to be any pair of nodes on 
a D-connected path being alternatively connected. |Kleinberg| (|1999[) also pointed out the 



difficulty of extending the alternating connectivity between a pair of nodes to a concept 
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that characterizes a group of tightly connected nodes (a community), because transitive 
relation does not hold in alternating connectivity. However, the D-connectivity bypasses 
this problem by recognizing the two different roles, sources and terminals. Next we define 
a new type of community structure, Directional Component, based on the D-connectivity. 

Definition 2 A Directional Component (DC) consists a source node set S and a ter- 
minal node set T (5, T C V) and they are the maximal subset of nodes such that any pair 
of nodes (s, t), s G S, t G T, are D-connected (s — » t). We call S and T the source part and 
terminal part of the directional component and denote DC = (S,T). 

The concept of directional component provides a potential way to partition a network into 
smaller communities. First, a node v in a directional component may belong to either the 
source part S or the terminal part T or both. Second, in a directed network that contains 
multiple directional components DC\, DC2, • • • , DCk, any node v G V can only belong to 
one of the source part S. In other words, the source parts Si, . . . , Sk are disjoint and the 
same holds for Ti, . . . , Tk- However, it is possible that v G Si and v G Tj for any i and j. 




Figure 2: The decomposition of the network in Figure |l| and rearranged adjacency matrix. 

This two-way partition of nodes respects the asymmetric property of the directed net- 
work. Figure [2] illustrates the decomposition of the directed network shown in Figure [TJ 
Three directional components are found and the source and terminal parts of each di- 
rectional component are displayed in boxes. As we discussed above, node A appear in the 
source part and the terminal part of the same component, but node D belongs to the source 
part and the terminal part of two different directional components. Meanwhile, all source 
(terminal) parts are disjoint. After reorganizing the nodes by directional components, it is 
clear that there is no edges between the source part and terminal part of different directional 
components, as shown in the right panel of Figure [2] Therefore, this two-way partition of 
nodes results in a re-ordered adjacency matrix that exhibits clear block- wise structure. 

Finding directional components can be achieved through a simple searching algorithm 
of computation complexity 0(\V\ + \£ |). Identification of one directed component is done by 
iterations of adding nodes into the source part and the terminal part. Staring with picking 
one node with a positive out-deg ree as a member of a source part, the algorithm iterates 
between steps: 
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1. For each node in source part, add all nodes pointed by this node to the terminal part; 

2. For each node in terminal part, add all nodes pointing this node to the source part; 

The iteration runs until no more nodes need be added in either source part or terminal 
part. Once the iteration terminates, one may remove all edges of the identified directional 
component from the original network and repeat the iteration to identify another directional 
component on the reduced network. 

One drawback of this direct searching algorithm is that it usually detect only few large 
directional components, even when the network has much finer community structures. This 
phenomenon is due to fact that it is unrealistic to expect absolutely no links between 
those small components. For example, a direct search on the Cora citation network [^J 
which presents bibliographic citation between papers in computer science, leads to one 
giant directional component that covers all nodes. However, a close look at the dataset 
suggests that there are much more citations between papers in a similar field and much less 
between papers in different ones. In other words, many smaller sized communities exist, 
but the algorithm is too strict to detect them. Therefore we need consider more flexible 
algorithms that may detect a directional component with existence of a small number of 
external edges. 



3. Regularized SVD Algorithms for Community Extraction 

We first explore the connection between the directional components in a directed network 
and the leading spectrum of its graph Laplacian Q, as defined in 0. These spectral 
properties motivate us to propose using regularized SVD algorithms to extract communities 
that are close to directional components. We also show that the computation of this class 
of regularized SVD algorithms is in linear order of the number of nodes and edges in the 
network, so it is scalable. 



3.1 Spectrum of Graph Laplacian and Directional Components 

We start with investigating the spectral properties of the graph Laplacian of a network with 
several DCs. It is well known that the spectrum of graph Laplacian of an undirected net- 



work is strongly tied with its connected components (Von Luxburg, 2007). More specifically, 
the multiplicity of the largest eigenvalue (one) is the same as the number of connected com- 
ponents in the network. The next theorem shows similar connections in directed networks, 
in which the concept of connected components is replaced by directional components. 

For a subset of nodes, A C V, in a network of |V| = n, we define 1a as an indicator 
vector of length n with each element 1a(z) — I( v i ^ A). Recall that Sk and represent 
the source part and terminal part of the fc-th directional component respectively. We have, 

Theorem 3 The principal singular value of Q in a directed network is one and its mul- 
tiplicity, K , equals to the number of directional components in the network. In addition, 

ii 

the principal left (or right) singular vector space is spanned by {D? lsi 5 • • • ? ls K } ( or 

II 
{Dl l Tl , • • • , Di 1 Tk }) respectively. 



2. More details of this dataset will be provided in Section [6] 
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The complete proof is included in Appendix |A.1| The main idea of the proof is behind 

the connection between the spectrum of a directed network and that of the bipartite graph 

i 

generated from it. The nonzero entries of each ls k correspond to the source part of one 

i 

directional component and similar result holds for {Dc lT k }fc=l,...,K an d terminal part. 

Theorem [3] indicates that the space spanned by the top singular vectors contains almost 
complete information about all directional components. However, the output space calcu- 
lated from any SVD algorithms is up to a rotation due to the multiplicity of the top singular 



value. The DI-SIM algorithm (Rohe and Yu, 2012) address this problem by applying the 



k-means algorithm on the left and right singular vectors separately. In addition, even with 
the presence of a small fraction of external edges between directional components, the space 
spanned by top singular vectors should still be stable when a large eigen-gap exists. 

Although it seems attractive to apply the DI-SIM algorithm to find directional com- 
ponents in the presence of external edges, a few limitations of the algorithm make it less 
effective in practice, especially for large networks. First, the two sets of clustering results 
from k-means on left and right singular vectors are not paired and it is not obvious how 
one may connect these two parts of the solution. Second, the DI-SIM algorithm needs to 
pre-specify the number of communities, which is unknown in practice. If the pre-specified 
value is too small, the leading spectrum of SVD will be unstable. As a result, the commu- 
nity detection results will be less accurate. More importantly, the computation of DI-SIM 
is very demanding for large networks with many communities, since it tries to detect all 
communities at the same time. SVD computation increases quadratically as the number of 
singular vectors increases and the k-means algorithm slows down dramatically when either 
the number of nodes or the number of groups is large. 

In response to these limitations, we consider direct-searching algorithms to identify one 
community at a time rather than attempting to discover all components by the division of 
nodes. This strategy leads to algorithms that do not need pre-specification of the number 
of communities and in turn provides stable results and scalable computation. 

3.2 Regularized SVD with Lq penalty 

To design methods that extract directional components from a network one at a time, we 
start with defining the size of a community C(S,T) that consists a source part S and 
terminal part T. In directed networks, such as citation network or social network, the 
distribution of in-degrees is usually very different from that of out-degrees. Therefore, it is 
nature to treat S and T differently. We define the size of a community (7(5, T) as 

SZ„(C) = SZ U (S,T) = \S\ +lj\T\, (3) 

where the constant uj > balances the in-degree and the out-degree. 

With this definition, we may order all directional components of a directed network in 

an increasing order of their sizes such that SZ u (DCi) < SZ^DC^) < • • • < SZ u (DCk)- 

I 1 

Since each directional component corresponds to a pair of vectors (D? ls k ? 2 lT k ) with the 
largest singular value 1, we should be able to identify the smallest directional component 
by penalizing the SVD by the size of the component corresponding to non-zero elements of 
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the singular vectors. This is achieved by adding Lq penalty to vectors u and v in SVD as 
max u*Qv - r/(||u||o + cj||v||o), ||u|| 2 < 1, ||v|| 2 < 1. (4) 

u,v 

The solution (u, v) from ^ leads to a community C(S, T) with S = {v : u(v) ^ 0} and 
T — {v : v(v) 7^ 0}. The following proposition connects the smallest directional component 
with the solution of the Lq regularized SVD. 



Proposition 4 There exists a cutoff value e > such that for any < 77 < e ; the community 
defined by the solution (u, v) of the optimization problem is the same as the smallest 
directional component in the directed network. 

With its proof given in Appendix |A.2[ the proposition implies that one of the smallest direc- 
tional components can be found by solving the optimization problem Q with a sufficiently 
small 77. Details in the proof also reveal that the optimization essentially searches for one 
of the smallest sub-matrices of Q that has singular value 1 in this case. 

We provide a simple example to illustrate the ideas behind this approach. The left panel 
of Figure [3a] shows the adjacency matrix of a network with ten nodes. The network has 
two directional components with different sizes. The parameter uo in defining the size of 
communities is set to 1.1 in ([5]). We randomly select two subsets (S, T) of nodes to generate 
a sub-matrix Q S (S,T) from the full graph Laplacian matrix Q, considering S,T as indexes 
of rows and columns of Q respectively. For each selected Q S (S, T), we calculate its principal 
singular value \i(Q s (S,T)) and its size SZ UJ (S,T). In addition to 500 randomly chosen 
sub-matrices, we also include those two directional components as testing communities. 



The right panel of Figure 3a show paired values (SZ CjJ (S^ T), \i(Q s (S, T))), with o, for 
each sampled sub-matrix Q S1 and those two directional components are marked as Xs. Let 
us denote the value of objective function in ([I]) as C. This figure shows that there exist 
a line with slope 77 > whose intercept C is maximized at the point corresponding to the 
smallest directional component as Proposition [4] describes. Therefore, the optimization Q 
leads to identification of the smaller of the two directional components in this network. To 
summarize the result, both Theorem [3] and the example show that directional components, 
if there is any, can be identified sequentially by Lq regularized SVD approach. 

Recall that we encountered the problem that the existence of a small number of external 
edges tends to connect smaller directional components together as one giant directional com- 
ponent. The root of the problem is the too strict requirement on finding exact directional 
components, the maximal set of node satisfying D-connectivity. The Lq regularized SVD 
limits the number of non-zero entries of the singular vectors, so it may find a component 
that is smaller and almost separated from other components. To illustrate the advantage 
of the regularized approach, we randomly add three external edges in the example. As a 
result, those two directional components merge together as one, as shown in the left panel 



of Figure 3b, The right panel of Figure 3b plots paired values (SZ U (S : T), Xi(Q s (S, T))) of 



the same 500 pairs of (S,T) shown in Figure 3a The principle singular values of two true 



directional components (X marks) have decreased because of the added external edges, but 
the line with the same slope is still capable to identify the smallest directional component. 



9 




(a) No external edges 




(b) After adding three external edges 

Figure 3: Left panel of (a): The adjacency matrix of an example network having two 
directional components. Right panel of (a): Scatter plots of SZ U0 {S 1 T) and \i{Q s {S, T)) 
of sub-matrix Q S {S 1 T) of the graph Laplacian matrix Q of a directed graph having two 
directional components. Left panel of (b): The adjacency matrix of the example network of 
Figure [3a| after adding three external edges. Right panel of (b): Scatter plots of SZ UJ {S^T) 
and \i{Q s {S 1 T)) of sub-matrix Q S {S 1 T) of the graph Laplacian matrix Q of the directed 
graph corrupted by external edges. 



3.2.1 L Regularized SVD Algorithm 



In this section, we show the solution the of Lq regularized S VD Q can be fou nd through 
iterative hard-thresholding. Similar with approaches taken in Shen and Huang (2008) and 
d'Aspremont et al. (2008), we start with exploiting the bi-linearity of the optimization 
problem Q. For a fixed vector v, we show how to solve the maximization problem with 
respect to u. Here we first introduce a notation: 
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Definition 5 Given a vector z = (z\, . . . , z n ) f E lZ n , we denote the l-th largest absolute 
value of z as \z\^y Consequently, we define z^(g 1Z n ) as the vector resulted from hard 
thresholding z by its (I + l)-th largest absolute entry, e.g. the i-th element of 

zf(z) = Zil(\zi\ > \z\ (i+1) ) = Zil(\zi\ > \z\ w ), 

while the superscript "h" stands for hard-thresholding. 

For a fixed v, we may treat Qv as a generic vector z and find the solution u that maximizes 
Q through the following result: 

Theorem 6 For a given vector z and a fixed constant p > 0, the solution of 

max u*z — p||u||o (5) 

||u|| 2 <l 

is 

u = zf/||zf|| 2 , 

where the integer I is the minimum number that satisfies 

\z\(l+i) < ^P 2 + 2p\\z l ?\\ 2 (6) 

Proof When ||u||o = max u u^z is obtained when u = z^/||z^||2. Thus the objective 
function ^ can be written as 

F(l) = \\4\\ 2 -pl. 

Now we try to maximize F(l) over /. Notice that ||z^||2 monotonically increases as / in- 
creases. The value of F{l) keeps increasing until 

sjH\\l + \Al +l) -\\^ih<p, 

which is equivalent to ([6]). After / goes beyond this point, since decreases and ||z^||2 
increases, F(l) starts to decrease and keep decreasing. Therefore, The solution to ^ is 
found at the minimum / that satisfies We note that when vector z contains tied 

absolute values, one may just select the first ones following the original order. ■ 

Theorem [6] leads a computationally efficient algorithm to solve the Lq regularized SVD in 
We first sort the entries of z by their absolute values and then sequentially search from the 
largest to smallest while testing if condition ^ has been met. As soon as the ^ is satisfied, 
we obtain the solution. The computation complexity of this direct searching algorithm is 
0(n log (n)). Consequently, the solution of regularized SVD problem Q is obtained by using 
the searching algorithm for a fixed v and for a fixed u alternatively. Algorithm [I] lists the 
details. 
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Algorithm 1 Lq regularized SVD 



Require: Q,t],uj 
initialize v 
repeat 

z <- Qv , p <- 7] 

u «— z^/Hz^l^ , where / is the minimum integer such that |^|(/+i) < \j p 2 + 2 p ijz^jj^ 
z ^— Q^u , p ^— 77 

v ^— z^/Hz^l^, , where / is the minimum integer such that |^|(/+i) < ^ p 2 + 2 p ||z^||2 
until u, v converged 
return u, v 



3.3 SVD with Elastic-net Penalty 



In the previous section, we find that the Lq regularized SVD may detect smaller and tighter 
communities in direct networks and it can be solved by an efficient algorithm based on the 
power method combined with hard thresholding. When the number of external edges is 
relatively small, we found the Lq regularized SVD performs well. However, when external 
edges introduce larger perturbation to the spectrum of Q, it may lead to difficulties for Q 
to identify the communities, for example, the line y = rfx + C may hit other o's before 
finding the second X in the right panel of Figure |3b} 

We may consider a slight modification of Q in the following form: 



max vlQw 

u,v 



subject to ||u||o + k>||v||o < 7, 



u 



< 1, 



< 1, 



(7) 



which searches for a sub-matrix of Q that has the largest singular value with a strict 



constraint on its size. The yellow vertical line in the right panel of Figure [3b] shows the 
constraint when 7 = 13. In this case, the solution of Q corresponds to the second direc- 
tional component. This formulation provides another option for recovering the directional 
components when extra edges present. However, finding a solution of Q is challenging due 
to the discrete nature of the constraint. 

A typical approach to overcome the computational difficulty related with the non- 
convexity of the Lq constraint is to relax Lq penalty to L\ penalty. Replacing Lq penalty 
to L\ penalty and separating the penalty for u and v result in 



maxu^Qv subject to ||u||i < Ci, ||v||i < C2, 1 1 u. 1 1 2 < 1, 1 1 v 1 1 2 < 1- 



(8) 



This optimization problem is identical to one version of the sparse matrix decomposition 
method proposed in Witten et al. (2009), in which the authors provided an algorithm to 
find a local solution. The algorithm uses the power method for SVD combined with soft- 



thresholding on singular vectors over iterations QShen and Huang , 2008; Witten et al. , 2009 
Lee et al. , 2010). However we found that the solution of (||) did not report significantly 
better solutions than Lq regularized SVD solution from ^ in our simulation studies. One 
of possible reason is that L\ constraint may fail to give sufficiently sparse solution of u and 
v, as pointed out by Yang et al. ( 2011[ ). 
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As an alternative, we propose SVD with Elastic-net type of penalty (Zou and Hastie 
2005]), 



maxu^v, 

u,v 

subject to (1 — ce) || u|| I + °<ll u lli < c i? (1 



(9) 



v||| + /3||v||i <c 2 , 



where the sparsity level is controlled by parameters a E [0,1) and /3 E [0,1). Note that 
a = (3 = leads to the regular SVD problem. When a E (0, 1), /? E (0, 1), the optimization 
problem becomes non-convex. We show that local optimal solutions of Q can be found by 
the power method with soft-thresholding. 



3.3.1 Elastic-net Regularized SVD Algorithm 

Similar to the calculation of the Lq regularized SVD, we take advantage of the bi-linearity 
of the optimization problem. For fixed v and a, (or u and /3), the optimization becomes 
convex, 

maxu^Qv, subject to (1 — ce)||u||| + c^||u.||i < ci. (10) 

u,v 

whose global solution can be obtained through simple soft-thresholding. We note that 



constraints. To find the solution of (10), we first introduce a notation: 



Witten et al. (2009) and Lee et al. (20101) showed similar results under slightly different 



Definition 7 For a vector z = (zi, . . . , z n ) f E 1Z n , recall the l-th largest absolute value of z 
was defined as \z\yy Denoting = for connivence and we define 



G z (x) 



k(x)-l 

h £ (i« 



4x 2 vi— ico "^ 2 + ^ 2^ 
i=l i=l 



k(x)-l 

h £ (N 



(11) 



where k(x) E {1, . . . , n + 1} satisfies \z\^ x y < x < \z\(k(x)-i)- 



From Witten et al. (2009) we borrow a notation, 5(z, d), as the result of soft-thresholding 
a vector z by a scaler d. Soft-thresholding is defined by S(z, d) — sign(z)(|z| — where 
d > and x + = max{x,0}. Again treating Qv as a generic vector z, we find the solution 
u that maximizes (p~0|) by the following result: 



Theorem 8 For a fixed vector z, the solution of the optimization problem, 
max i^z, subject to (1 — a)||u||2 + <^|| u l|i ^ c i 



is 



2d(l ~ a) Q( . 
u — S(z, a), 

a 



and the threshold level d is the solution of G z {d) = c\(l — a) /a 2 . 
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Algorithm 2 SVD with elastic-net penalty 



Require: Q, a, (3, ci, C2 
initialize v 
repeat 

d ^— the solution x of Gq v (x) = c\(l — a) /a 2 

u ^2^-o) 5(QVjd) 

d ^— the solution x of Gqt u (x) = 02(1 — /3)//3 2 
v ^MH) 5( Qt M) 

until u, v are converged 
return u, v 



The proof of the theorem is provided in the Appendix |A.4 and its first part resembles the 
proof of Lemma 2.2 of Witten et al. (2009). This theorem leads to Algorithm [2j The com- 
putation in Algorithm [2] involves solving for the soft threshold level d in equation G z (d) = c, 
where c is some constant in the range of the function G z (-). The solution d is described in 
Lemma 



Lemma 9 The solution of the equation G z (d) = c for given c > is 



Et 



1(0 



4c + k 



(12) 



where k is a positive integer in {1, 2, . . . , n} that satisfies G z 



!(*+!)< 



> c. 



The proof of Lemma [9] is detailed in Appendix A.5| along with a simple algorithm for finding 
k in the Lemma. Our contribution is that we seek the threshold level d in the linear time 
that is proportional to the number of non-zero entries of the solutions, which makes the 
computation feasible for large matrix in comparison to the binary search method proposed 



m 



Witten et al. (2009). Even though Q is nonnegative matrix in our problem, the linear 
search method can be applied to any real valued matrix. Interestingly, ([§]) can also be solved 
using the linear search method instead of the binary search method. We verified that the 
linear search method is faster than the binary search method by 3 to 20 times when the 
number of nodes in the network is between 10 3 and 10 7 . 

In summary, we propose two linearly scalable algorithms, the Lq regularized SVD and 
the Elastic-net regularized SVD, for extracting one community from a directed graph. In the 
next section, we will apply these community extraction algorithms repeatedly to a network 
and try to harvest all tight communities sequentially. 



4. Community Harvesting Through Regularized SVDs 

Previous results suggest that the regularized SVD algorithms are capable of recovering a 
community like directional component under the presence of noise edges, when the penal- 
ization parameters, 77 in Q or (a, (3) in Q, are properly chosen. In addition to specifying 
the algorithm parameters, one also needs to provide a starting vector v or u to initialize 
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the algorithm. In this section, we first discuss the effect of these parameter specifications 
and how to choose them in practice. Then we propose a community harvesting scheme 
that repeatedly use the regularized SVD algorithm to extract all communities that are 
approximate directional components. 



4.1 Parameter Selection and Initialization for Regularized SVDs 

We now concentrate on studying the effect of the penalization parameters on the algorithm 
outputs and proposing practical rules on how to specify them. We use the Elastic-net 
regularized SVD in this section, but the proposed parameter-selection procedure generalizes 
to the Lq regularized SVD easily. For Elastic- net regularized SVD, we first point out that 



the parameters c\ and C2 in (10) can be set to one as default values, since they only affect 
the magnitude of the solution vectors. Theorem [8] suggests that, for any change in c\ > 0, 
the value of a can be modified to produce the same value of c\{l — a) /a 2 since the range of 
(1 — a) /a 2 is (0, oo]. One may concern the positive multiplicative change in u caused by the 
change of a. However, the multiplicative change in u only leads to the same multiplicative 
change in v in the following optimization. All multiplicative changes over the optimizations 
do not change the sets of nodes, S = {v : u(v) ^ 0} and T = {v : v(v) ^ 0}. 

One property of the regularized Elastic-net SVD algorithm is that one may obtain 
solutions that change smoothly over small changes of penalization parameters (a, (3). The 
trick is to use the solution v* (or u*) at the current sparsity for the initial vector of the 
optimization problem of the next contiguous sparsity level. The small change in the sparsity 
levels allows the algorithm converges within few iterations without dramatic changes in 
u*,v*. This strategy of setting initial vectors of Algorithm [2] allows us to construct the 
entire solution path on a range of sparsity levels efficiently. 

We present an example that shows snap-shots of the solutions corresponding to a few dif- 
ferent penalization parameters on a solution path. A network of size 60 with 3 communities, 



A, B, and C with 20 nodes in each, is simulated from the stochastic block model (Holland 



et al. , 1983[ ). For the three groups, the probability of existing a directed edge between 



any ordered pair of nodes, (v s ,v t ), is set to 0.3 if v s E A,v t E B or v s E B,vt E A or 
v s E C,Vt E C and the probability is set to 0.05 otherwise. A realization of the network 
is shown in the form of adjacency matrix in Figure [4j We observe three strong blocks of 
dense connections when the order of nodes follows the true grouping (A, B and C from left 
to right). 

A solution path from the Elastic-net regularized SVD is generated with an initialization 
vector v = l{ Vl y and parameters a = f3 decreasing from 0.8 to 0.1 by a step size of 0.05. 
The panels in Figure [4] show the extracted communities (in red dots) at four different 
sparsity levels a = /3 = 0.6,0.4,0.15,0.1 on the path. As we expected, it is obvious to 
observe the nested structures among the detected communities on the solution path. At a 
balanced point, we observe that the algorithm captures the most of links in (5, T) = (£> , A) 
at sparsity level 0.15 while not including too many external edges. Therefore, we achieve 
good community detection if we are able to select a good penalization level. 

We propose to use a directed Normalized-cut (d-Ncut) criterion to select proper penal- 
ization parameters on the solution path. For two sets of nodes A^i,A^, the cut of edges 
starting from Ni and ended at N2 is Cut(iVi, N2) = X^eiVi jeN 2 Wij- Given a detected pair 
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Figure 4: A simulated directed graph from stochastic block model. Within probability 
0.3, between probability 0.05, 20 nodes for each block. By decreasing sparsity level, more 
significantly related links join to (S,T). 

of source set and terminal set (£, T), the d-Ncut criterion is defined as 

H iwr<? T\ 1 f Cut (^ f ) , Cut(g,r) Cut(5,r) Cut(5,f) ) 
d-Ncut(S,T) = - 1^^- + + + j + 

Cut(5,T) | ; 1 I +Cut(5,f) ( . 1 _ - ; 1 _ I , 

where S and T are the compliment nodes sets of S and T respectively. The d-Ncut criterion 
measures how much (5, T) and (S,T) are close to directional components. The detailed 
derivation of d-Ncut is presented in Appendix [Bj The d-Ncut value greater than 1 indicates 
no community structure in T) while a smaller positive d-Ncut value indicates a stronger 
community structure, for example, directional components have zero d-Ncut value. The d- 
Ncut values of the four different results shown in Figure |i] are 0.653, 0.586, 0.398 and 0.425, 
which correspond to the penalization parameter at 0.6, 0.4, 0.15, and 0.1. The minimum 
d-Ncut value on the whole solution path is actually 0.398, which leads us to pick the results 
at sparsity levels a = (3 = 0.15. 

Another interesting observation made in this example is the dependence between the 
extracted community and the initialization vector. Since both the Elastic-net regularized 
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SVD algorithm and the Lq regularized SVD algorithm are based on local updating, their 
outputs are sensitive to the initial vector vq. In the example, the algorithm would have 
recovered another collection of links in (5, T) = (A,B) if it started from vo = 1{^ 30 }- 
Depending on the purpose of the analysis, one may start with a constant vector of vo and 
increase the sparsity levels to recover a rather big d-comp (Top-down), or start with an 
indicator vector of vo and decrease the sparsity levels to recover a d-comp close to the 
initial node (Bottom-up). 

Coupling the d-Ncut criterion with the solution path generating method based on the 
regularized SVDs, we arrive at a practical algorithm that extracts one community from the 
network. We name the identified community (£, T) from this algorithm as a Approximated 
Directional Component (ADC), to distinguish it from directional components. The steps of 
the algorithm are described in Algorithm [3j We note that one may replace the Elastic-net 
regularized SVD with the Lq regularized version easily. 



Algorithm 3 Community Extraction though Elastic-net Regularized SVD 

Require: Q, initialization vector vo, grid of sparsity levels {(ai, 
initialize v ^— vo 
for i = 1 —> I do 

Obtain u*,v* by running Algorithm [2] with parameters (Q, ol\, A, 1, 1) and initial- 
ization v. 

S = {v : u*(v) + 0} and T = {v : v*(v) + 0} 
d-Ncut; <- d-Ncut (S,T) 
(S\T { ) <- (S,T), v* 
end for 

return S = S* J and T = T J , where j = argmin^ d-Ncut^ 



The algorithm requires an initialization vector vo and a grid for the sparsity level pa- 
rameters from users. The initialization vector vo can be set as with a randomly picked 
Vi with nonzero degree or simply picking the node with the largest in-degree. The choice of 
grid of sparsity levels is critical to discover sensible ADCs but it is not trivial to construct. 
We provide general guide lines for designing a grid. 

• Make ||(o^+i, A+l) — small enough to obtain gradual change of (u*,v*). 

• The default ratio of a and f3 is one, which provides similar ratio of source parts and 
terminal parts to the ratio of the whole network. One may provide different ratio if 
there is prior knowledge of unbalanced communities. 

• The choice of grid should reflect the balance between the computational burden and 
the quality of communities. 

In practice, one can check the d-Ncut values of several solution paths with a fine grid to 
obtain rough idea of the reasonable range and fineness of the grid for the dataset at hand. 
We will revisit this issue in more detail in Section [6l 
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4.2 Community Harvesting Algorithm 

In order to identify all communities in a directed network, we propose to apply Algorithm [3] 
repeatedly through a community harvesting scheme. The idea of community extraction has 
been discussed in Zhao et al. (2011), in which a modularity based method are proposed. 
Unfortunately, this class of methods based on modularity is NP hard to solve. Starting with 
the graph Laplacian matrix Q of the full network, we first apply Algorithm [3] to identify a 
ADC{S 1 T). Then all entries in Q that corresponds to (S, T) are set to zero and we reapply 
the same algorithm to the reduced Q matrix with a different initialization to identify the 
next ADC. Procedure continues until Q has no significant ADCs. Typically, indication of 
no more significant ADC appears as a continued sequence of small ADCs with very few 
nodes (3^4). We call this procedure harvesting since it extracts one component from the 
network at a time. Algorithm [4] summarizes the harvesting algorithm. 



Algorithm 4 Community Harvesting Algorithm 

Require: Q, i = 1 
repeat 

Obtain S, T using Algorithm [5] with Q and (vi is a randomly chosen positive 

degree node of Q) 
Q(l 5 ,lr) <-0 

i <- i + 1 
until Q has no significant ADCs 
return {(Sj.Tj)}^^ 



We point out that all harvested ADCs results in a partition of all edges. The edges of 
directional components are likely to belong meaningfully large ADCs while external edges 
are likely to compose trivial ADCs that have only few edges. Thus, we recommend to ignore 
too small ADCs in the context of communities desired. As for the nodes, the harvesting 
algorithm does not necessarily result in a partition of source nodes (terminal nodes), since 
the source parts (terminal parts) of different ADCs may have overlaps. However, if a graph 
has strong community structure, source parts and terminal parts of meaningfully large 
ADCs only have negligible overlaps. 

We also want to stress that the harvesting algorithm takes different approach from the 
ones in sparse SVD algorithms for obtaining multiple sparse singular vectors. Wi t ten et al. 



(2009) and Lee et al. (2010) use the residual matrix, Q — suv f where s is pseudo singular 
value, to obtain multiple sparse singular vectors. This approach does not fit to our purpose 
because only the principal singular vector of a sub-matrix is required for a directional 
component. In addition, harvesting algorithm increases the sparsity of Q whereas the other 
approach makes the residual matrix more dense. 



5. Simulation Studies 

In this section, we evaluate the performance of the two harvesting algorithms, Lg-harvesting 
and EN- harvesting under various settings of community structure. In addition to the 
harvesting algorithms, the DI-SIM algorithm is included for the sake of comparison. We 
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find that, in addition to the proportion of external edges, the average degree and the size 
of communities are also important factors determining community detection accuracies. 

To generate networks with different types of community structures, we follow a bench- 
mark model proposed by Lancichinetti et al. (2008), referred as the LFR model. The LFR 
model is based on a restricted version of the stochastic block model where each node has 
a probability of being connected to nodes in the same community and another probability 
of being connected to nodes in other communities. This model is originally developed for 
undirected networks but it has been extended to directed networks by |Lancichinetti and 



Fortunato (2009b). The LFR model introduces heterogeneous distributions of degree and 
community sizes. The out-degrees of all nodes are almost constant while the in-degrees 



follow a power law distribution. This model is more realistic than GN benchmark of|Girvan 



and Newman ([2002) in applications like citation networks and online social networks. 

In our study, we generate networks from the LFR model with n — 1000 nodes, whose 
in-degrees follow a power law (with decay rate t\ — —2) with maximum at k max = 50. The 
sizes of the communities in each network are assumed to follow a power law with a decay 
rate T2 = — 1 and the sizes of source part and terminal part are the same. We vary three 
sets of parameters of LFR model to control different aspects of the simulated networks: 



• Range of community sizes is set at two levels through a pair of parameters 
(5^=1 (C) mm , S^=i(C)max)- 

(40, 200) for big communities and (20, 100) for small communities; 

• Average degrees (in-degree and out-degree) k for all nodes are set at three levels: 
{5, 10, 20} for sparse, median and dense networks; 

• Proportion of external edges \± for all nodes is set at three levels: 
{0.05,0.2,0.4}. 



Before providing the details of simulation results, we show an example of the simulated 
network and results of the three community detection methods in Figure [5j This network 
with big communities, (S^=i(C) mm , SZ u;= i(C)max.) = (40,200), is generated with pa- 
rameters k = 20 and \i — 0.1. Rows of matrix correspond to source nodes and columns 
correspond to terminal nodes while each dot in the plot represents an edge. The top left 
panel is the adjacency matrix of the simulated network. The rest of panels present commu- 
nity structures found by the three algorithms in comparison. By the design of the DI-SIM 
algorithm, it provides two unrelated partitions for rows and columns. In contrast, the har- 
vesting algorithms recover directional components by collecting edges of each component 
and they indeed showed almost perfect recovery in this example. 



5.1 Simulation Results 

Back to the full simulation, we measure the accuracy of community detection results by a 
mutual information based criterion that was proposed by Lancichinetti et al. (2009) and 



used for the comparison of various community detection algorithms done by Lancichinetti 



and Fortunato (2009a). One advantage of this criterion is its ability to handle overlapping 



communities, see details in the appendix of Lancichinetti et al. (2009). As like other mutual 
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Original DI-SIM 




Figure 5: Example random matrix generated by LFR benchmark and the results of DI- 
SIM algorithm and harvesting algorithms (top right: DI-SIM, bottom left: SVD-LO, bottom 
right SVD-EN). 



information based criterion, the accuracy measure has maximum value one for perfect match 
and has minimum value zero for community assignment that is independent of the truth. 

The DI-SIM algorithm results in two different partitions of nodes, thus we calculate 
their accuracies separately with respect to the true partitions and average them. In order 
to compare harvesting algorithms with the DI-SIM algorithm, we calculate the accuracies of 
harvesting algorithms based on source parts and terminal parts instead of computing based 
on edge. The source parts and terminal parts of the harvested ADCs are used to measure 
the accuracies separately and again the accuracies are averaged. 

When applying the DI-SIM algorithm, we assume the true number of communities Nq 
is known. We compute the first Nq singular vectors of Q and apply the k-means algorithm 
with Nc clusters on the left and right singular vectors separately. One hundred random 
initialization for k-means algorithm is applied and the result minimizing the within-cluster 
sums of point-to-cluster-centroid distances is reported as the final result. 

For harvesting algorithms, the bottom-up strategy is used accompanied with vo being 
the node of largest in-degree at each harvesting. The sparsity levels for source part and 
terminal part are set to the same value, A = 1 in Q and a — (3 in Q, and the grids of 
them are determined without using the knowledge of the size of communities. The grid of 
sparsity levels for Lq penalty, 77, contains 100 equally spaced points in a range[10 -2 , 10 -5 ] 
and the grid of sparsity levels for EN penalty, a, includes 100 equally spaced points in a 
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range [0.1,0.98]. The end-points of the grid are selected to cover ADCs sized from two 
nodes to about the half of nodes in random networks. 

In this simulation, we generate ten random networks under each of the eighteen (2x3x3) 
parameter combinations and the average of mutual information based accuracy of each 
algorithm is reported. The results for networks with large communities and those with 
small communities are reported in Figure [6] and Figure [7] respectively. In these figures, each 
of the nine panels on the left side visualizes one of the ten generated networks, while the bar 
chart on the right side shows the corresponding accuracies from three different algorithms. 
Recall that the range of the accuracy measure is [0, 1] and the larger the value, the better 
the accuracy. 

The results for big communities in Figure [6] show that the harvesting algorithms report 
almost perfect community detection when nodes has average degree of 10 or 20 and fi is 0.2 
or 0.05. The networks with such average degree and fi correspond to the strong commu- 
nity structure that ensures D-connectivity of the nodes in true directional components and 
relatively small fraction of external edges. The £?7V-harvesting shows better performance 
than the Lo-harvesting in the region of strong community structure. Moreover, the EN- 
harvesting gives almost the perfect result in the setting of \± — 0.4 and average degree 20. 
As we mentioned in Figure [5], the DI-SIM algorithm fails to give a perfect result even for the 
high average degrees. However, the DI-SIM algorithm gives better results than harvesting 
algorithms in the region of relatively weak community structures. 

The accuracies for detecting small communities change slightly from the ones of big 
communities (Figure [7]). The accuracies of the other two methods have decreased for the 
region of strong community structures. The k-means algorithm in DI-SIM algorithm seems 
to be less accurate for the larger number of clusters in the setting of small communities. 
However, the accuracy of the i?Af-harvesting algorithm with elastic-net penalty is similar 
to the results of the setting of big communities. 

Closer investigation revealed that sources of the loss in accuracies are quite different for 
the harvesting algorithms and the DI-SIM algorithm. The loss of accuracies of the DI-SIM 
algorithm mainly came from some clusters dividing true communities. Unfortunately, it is 
not straight forward to come up with further improvement of such clusters. In contrast, 
the loss of accuracies of harvesting algorithms mostly come from several ADCs merging 
true communities. In such a case, it is possible to be further improved by rather simple 
post-processing, for example, applying a harvesting algorithm one more time on the ADC 
with finer grid of sparsity levels. 

In our experiment, we also find that the performance of harvesting algorithms is as 



good as that of Infomap, which shows the best performance in the report of Lancichinetti 



and Fortunato (2009a). However, the performance of Infomap grounds on the assumption 
that the true communities have the same source part and terminal part S = T and the 
performance can be dramatically dropped without the assumption. In contrast, harvesting 
algorithms do not require the assumption on the true communities since the source part 
and the terminal part of a directional component may be totally different. 
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Networks with Big Communities 



Community Detection Accuracy 




Average degree Average degree 



(a) Adjacency matrices of networks with big communities. Rows and 

columns are arranged by the true communities. ( b ) Community detection accuracy of three tested algorithms. 

Figure 6: Accuracies of three algorithms, DI-SIM, Lo-harvesting and TV-harvesting, in nine different settings of the community 
structure. The x-axis indicates average degree and the y-axis indicates the proportion of external edges. The left panel shows an 
example network at each setting. The accuracies are displayed as bar charts in the right panel. The size of communities ranges 
in 40 - 200. 



Networks with Small Communities 



Community Detection Accuracy 
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(a) Adjacency matrices of networks with small communities. Rows 
and columns are arranged by the true communities. 
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(b) Community detection accuracy of the three algorithms. 



Figure 7: Accuracies of three algorithms, DI-SIM, Lo-harvesting and EW-harvesting, in nine different settings of the community 
structure. The x-axis indicates average degree and the y-axis indicates the proportion of external edges. The left panel shows an 
example network at each setting. The accuracies are displayed as bar charts in the right panel. The size of communities ranges 
in 20 - 100. 



5.2 Computation Time 



One driving motivation of this paper is the scalability of community detection algorithms 
on large or massive networks. In addition to the study of accuracies, we investigate the 
computation requirements of different algorithms in this simulation study. The algorithms 



are implemented in MATLAB on a linux machine (2x Six Core Xeon X5650 / 2.66GHz / 
48GB). 

We gradually increase the number of nodes in networks, 10 3 ,5 x 10 3 ,10 4 ,10 5 , to see 
the scalability of the three algorithms in comparison. Networks are generated from the 
LFR benchmark with the same settings of big communities (size ranged from 40 to 200) 
with n — 0.2 and k = 20. Upon the fixed range of the size of communities, the number of 
communities in the network linearly increase as the number of nodes increases. 



The harvesting algorithms were performed in the same way as described in Section 5.1 
except that the harvesting algorithms run until they find ADCs as many as the number of 
true communities. The computation time for the DI-SIM algorithm is split into two parts, 
one for singular value decomposition and the other for the k-means algorithm with 100 
random initialization. 

Among the three algorithms, the Lo-harvesting algorithm is the fastest for large net- 
works (Table [I]). The TV-harvesting algorithm takes roughly two or three times of more 
computation, which is consistent with the observation that the i?7V-harvesting does more 
extensive search of candidate ADCs. In response to more computation time, the accuracy 
of TV-harvesting method was significantly higher than that of Lo-harvesting for all cases; 
0.999 for EW-harvesting and 0.88 ~ 0.92 for Lo-harvesting. Compared to harvesting algo- 
rithms, the DI-SIM algorithm does not scale well to large networks with many communities. 



\N\ 


# of communities 


L 


EN 


DI-SIM 


10 3 


19 


16.9 


43.8 


0.08 + 4.7 


5 x 10 3 


103 


72 


318 


3 + 809.5 


10 4 


201 


327 


807 


14 + 46 x 100 « 4.6 x 10 4 


10 5 


1986 


41,400 


86,160 


10,427 + 38,880 x 100 « 4.0 x 10 6 



Table 1: Computation times (in seconds). Four sample networks are generated for four 
different numbers of nodes with the common settings \± — 0.2 and average degree 20. 



6. Applications 

In this section, we apply the proposed harvesting algorithms to two highly asymmetric 
directed networks, a paper citation network and a social network. Paper citation networks 
are highly asymmetric directed networks because of their temporal structure; a paper can 
cite only existing papers. The social network used in this application is highly asymmetric 
due to a small fraction of popular users with a high fraction of total in-degrees. We show 
that harvesting algorithms can capture the communities reflecting two different roles of 
nodes even in such highly asymmetric directed networks. 

Before reporting details of the experiments, we first address the issue of deciding the 
range of sparsity levels for the harvesting algorithms in practice. Since we have limited 
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knowledge on the size of communities, a wide range of sparsity levels is typically necessary. 
However, we observed that given the wide range of sparsity levels we may end up with a 
small number of big communities. In order to capture small but meaningful communities, 
we propose to stop Algorithm [3] early if the d-Ncut value reaches a local minimum in the 
grid of sparsity levels. A simple implementation used in this experiment is to stop searching 
for candidate ADCs if the d-Ncut value of the current candidate-ADC bounces up to higher 
than s p times of the minimum d-Ncut value of the previously detected candidate- ADCs. In 
addition, we pre-specify a bound si on the desired d-Ncut value so we only stop early at 
communities that report a d-Ncut value lower than S[. 

6.1 Cora Citation Network 

We first apply both harvesting algorithms to the Cora citation network, a directed network 
formed by citations among Computer Science (CS) research papers^] In this experiment, we 
use a subset of the papers that have been manually assigned into categories that represent 
10 major fields in computer science, which is further divided into 70 sub-fields. This leads 
to a network of 30,228 nodes and 110,654 edges after removing self-edges. In this citation 
network, only 5.4% of edges are symmetric. The average degree is 3.66, which is relatively 
low. We also found that 2345 nodes had error labels and they were put into 11th category. 

The algorithms are applied with a bottom-up approach such that the algorithms start at 
the terminal nodes with the largest in-degree among un-harvested nodes at each harvesting 
run. The sparsity levels parameter 77 in the Lg-harvesting takes values decreasingly in a grid 
{exp(— k) : k = 10 + z(12/200), i = 1, . . . , 200}. Similarly, the sparsity levels parameter a in 
the i?iV-harvesting takes values decreasingly in a grid { i+exp(fc) 

: k = -l + z(ll/200),z = 

1, . . . , 200}. The nonlinear decreasing setup helps obtain gradual changes of the size of the 
candidate- ADCs at low sparsity levels. Early stopping parameters are set to s p = 1.5 and 
Sl = 0.3. 

For both harvesting algorithms, we first provide a summary of the first twenty ADCs 
discovered. We name the ADC obtained in the Lg-harvesting ADC L ° and the ones obtained 
by the £" TV-harvesting ADC EN . We report the sizes of source part and terminal part, the 
number of edges and d-Ncut value for each ADC in Table [2| Out of total 110,654 edges, the 
first twenty ADC L °s cover 107,493 edges and the first twenty ADC EN s cover 95,529 edges. 
We observe that larger communities are likely to be captured in the first several ADCs 
because the initial value vq is more likely to be a member of large communities. Overall, 
we find ADC L °s are better than ADC EN s based on the comparison of d-Ncut values. This 
result is consistent with the result of simulations in Section [5] that Lo-harvesting performs 
better in networks of low average- degrees. 

6.1.1 Comparison to DI-SIM and Infomap 

To evaluate the performance of harvesting algorithms, two existing community detection 
algorithms are also applied for comparisons. First, the DI-SIM algorithm (iRohe and Yu 



2012) is applied, assuming the number of communities equals to the number of major- 



fields in CS, which is ten. For the k-means step of the DI-SIM algorithm, ten random 



3. http: //people . cs .umass . edu/~mccallum/data. html 
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(a) First 20 ADC L ° of Cora network. (b) First 20 ADC EN of Cora network. 



Table 2: Summary of the first 20 ADCs of Cora citation network. 



initialization is used. Second, we applied the Infomap algorithm of Rosvall et al. (2009), 



which showed very good performance in LFR benchmark as reported by Lancichinetti and 



Fortunato (2009a). Next, we first provide a visual comparison of communities detected by 
these four algorithms in Figure |5J 

The visualization of the results of harvesting algorithms through adjacency matrix is not 
straightforward since nodes may appear more than once due to the possibility of multiple 
memberships. To see the community structure, the rows and columns are arranged by the 
source parts and the terminal parts of the twenty approximated directional components and 
the remaining nodes are appended at the end of rows and columns. Edges are shown as 
blue dots in the plot. Internal edges of ADC appear as blue blocks in the diagonal and all 
internal edges appear only once in the visualization. Meanwhile, blue dots outside of the 
blocks are the edges that are not harvested in the first twenty harvesting. As the harvesting 
goes on, all edges outside of the blocks will eventually append to the diagonal blocks and 
appear as a thin line at the end. We also use yellow dots to indicate the reappearing internal 
edges of ADCs that appear between blocks because of the multiple memberships of source 
nodes and terminal nodes. 

The lower panels in Figure [8] show results of the methods in comparison. The result 
from the DI-SIM algorithm is summarized by the adjacency matrix of the Cora citation 
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Figure 8: Top panels: The results of harvesting algorithms of the Cora citation network. 
The rows and columns are arranged by the source parts and the terminal parts of the first 
twenty ADCs and remaining nodes are appended at the end of rows and columns. Bottom 
panels: Adjacency matrix of the Cora citation network with rows and columns reordered 
by the result of the DI-SIM algorithm and Infomap. 



network with rows and columns reordered by the partitions (Figure 8c). The row of matrix 
is reordered by the partition of source nodes and the column of matrix is reordered by 
the partition of terminal nodes. The adjacency matrix rearranged by the communities of 



Infomap is shown in Figure |8dJ in which the order of rows and columns are the same. 
Comparing all four panels, we conclude that the obvious block structure in the plots of 
Lg-harvesting better represents the community structure in the Cora citation network. 

The communities detected by harvesting algorithms have more heterogeneous sizes com- 
pared to the communities of Infomap (Figured). For example, {SZi(ADC^ )}i=i^ mm ^o range 
from 50 to 20,000 while the Infomap provide 5312 small communities whose number of nodes 
is less than 50 and 114 communities whose number of nodes is between 51 and 270. This 
difference comes from the fact that the harvesting algorithm accounts the asymmetric na- 
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ture of the citation network while Infomap looks for symmetric communities on the highly 
asymmetric directed network. 

6.1.2 Comparison through Manual Categories 

The manually assigned categories of papers in the Cora citation network provided us with 
extra information to validate the quality of communities detected from different algorithms. 
Before making comparisons we present a summary of the eleven categories in Table |3j The 
sizes of category span a large range, from 10,784 papers in Artificial Intelligence to 582 
papers in Information Retrieval. Given the categories, we calculate the d-Ncut value of 
each category to form a baseline. We observe that the values are moderate and ranged from 
0.29 to 0.49, which are slightly higher than the d-Ncut values of ADC L °s. 

We investigate the consistency between detected communities from each algorithm and 
the manually assigned categories. The communities of Lg-harvesting algorithm are reported 
in detail in Table [1J while the results of other algorithms can be found in Appendix [Cj The 
communities are reported by their order of being harvested. 

The first five harvested communities are fairy large and reveal interactions among fields 
of CS. Papers in ADC^ are mainly from two fields, artificial intelligence (AI) and human 
computer interaction (HCI) and further investigation shows that majority of these papers 
in AI are in the vision and pattern recognition sub- field. ADC^ and ADC^° are dominated 
by papers from AI and these two communities take up nearly half of the papers of the AI 
category. ADC^° also embeds an interaction between a sub-field of AI (theorem proving) 
and two sub-fields of data structures algorithms and theory (formal languages sub-field and 
logic sub-field). ADC^° is the biggest community that is mainly led by four categories; 
operating systems, programming, databases and networking. 

The rest of those communities are smaller in sizes and each contains less kind of cat- 
egories. In other words, the small communities have high precision and low recall with 
respect to the manual categories. Many of small communities are related to AI category 
and they represent different sub-fields of AI. For example, ADC^ corresponds to speech 
sub-field and natural language processing sub-field. ADC^ mainly covers knowledge rep- 
resentation sub-field. There are also meaningful small communities from fields other than 
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Table 3: List of ten fields of Computer Science. 
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Table 4: Number of papers in the first twenty approximated directional components of 
Lo-harvesting for each category. 



AI, for instance, ADC^ stands for logic design and VLSI sub-field of hardware and ar- 
chitecture. Finally, we found that ADC^g shows a connection between natural language 
processing sub-field of AI and information retrieval. 

The analysis on the composition of manual categories shows that the communities de- 
tected by the harvesting algorithm meets our expectations regarding the manual categories. 
In addition, the discovery of the several large communities and many small communities 
are consistent with the core- whisker model briefed by Leskovec et al. (2008). We also sus- 
pect a possible hierarchical community structure within the large communities but we leave 
investigations along this direction in our future work. 



6.2 Scalability of Harvesting Algorithms in a Large Social Network 

The recent emergence of social networks and their wide accessibility greatly enhanced our 
ability to obtain and deliver information in a much shorter time span. However, the massive 
size, more than millions of nodes in a modern network, demand scalable algorithms for 
analysis. Many community detection algorithms that search for optimal partition of nodes 
does not scale well to these large social networks. On the other hand, the harvesting 
algorithm detects a community at a time according to the quality of the detected community. 
In this experiment, we test our harvesting algorithms on one such huge network over which 
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the partitioning approaches may fail, especially when the network is highly asymmetric and 
the sizes of communities are relatively small. 

We use a social network dataset^of Tencent Weibo, a micro-blogging website of China. 
Users in this network may subscribe to news feeds from other users based on their interests. 
The subscriptions are represented as directed edges between users. This network contains 
1,944,589 non-zero degree nodes and 50,655,143 directed edges, which leads to the average 
out-degree 25. The social network is highly asymmetric and it has only 0.2% of symmetric 



links. We test the harvesting algorithms implemented in MATLAB on a linux machine (2x 
Six Core Xeon X5650 / 2.66GHz / 48GB). The sparsity level parameter in the harvesting 
algorithms are designed to capture communities with the size in the range of 10 to 2,000,000. 
The grid of sparsity parameter 77 in Lo-harvesting is set as {exp(— k) : k = 18 + i(5/50), i = 
1, . . . , 50} and the grid for a = (3 in £" TV-harvesting is set as { i+ e xp(fc) ? ^ — 3 + i(8/50), i = 
1, . . . , 50}. The early stopping method is applied with the parameters s p = 1.1 and si = 0.8. 

The computation time to harvest 1000 ADC L ° was about 51 hours while that of har- 
vesting 1000 ADC EN was around 46 hours. On average, each harvesting took about 3 
minutes for both algorithms. In addition, approximately 4 ~ 5GB of memory were required 
for both algorithms. For comparison, a representation of partitioning algorithms, Infomap, 
is applied with R package igraph. Unfortunately, the algorithm has stopped with an error 
after 56 hours of running. 

To see the quality of harvested communities, we calculate the d-Ncut values along with 



the size of ADCs (Figure 9a). Different from the case of Cora citation network, the EN- 
harvesting reports better d-Ncut values than the Lo-harvesting algorithm does. Especially, 
the TV-harvesting is able to detect small communities with low d-Ncut values because the 
constraint Q is directly applied on the surrogate size of a community, while the optimization 
Q in the Lo-harvesting directly penalizes the size. We also verified that good communities 



are relatively small (~ 200) in such huge social networks, as reported in Leskovec et al. 



Q2008D . 

In addition, we find it interesting to compare the size of the source part and that of 
the terminal part. Specifically, we investigate how much of members are common in both 
parts. We define commonality of a ADC as the ratio of the number of common nodes 
against the total number of nodes in the union of the two parts. It appears that small 
communities are more likely to have higher commonality while big communities tends to 



have low commonality (Figure 9b). The high commonality in small communities implies that 
the members of small communities play the role of source and terminal at the same time, 
which is common in small private communities. In contrast, further inspection showed that 
most of members in big communities play the role of source while only few popular members 
play the role of terminal. Besides illustration of the scalability of the harvesting algorithms, 
this empirical observation highlights that our algorithms are capable of incorporating the 
direction of links and detecting small communities. 

7. Conclusions and Discussion 

In this paper, we found that integrating the direction of links plays an important part in 
the characterization of communities in directed networks. We first introduced a new notion 



4. http://www.kddcup2012.org/c/kddcup2012-trackl/data 
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Figure 9: Summarize the result of harvesting algorithms on a social network (a) Scatter 
plot of d-Ncut vs size of ADCs. (b) Scatter plot of commonality vs size of ADC EN s. 



of community, directional components, that is capable of discerning the two different roles 
of each node in a community. Such communities are directly connected with the spectrum 
of the graph Laplacian. 

We proposed two regularized SVD based harvesting algorithms that sequentially identifle 
communities that have majority of links within a community while few links are placed 
between the communities. The regularized SVD method is linearly scalable to the number 
of nodes and the number of edges in the network, so it is computationally efficient. The Lo- 
harvesting algorithm showed good performance even in networks having low average-degree. 
Meanwhile the TV-harvesting algorithm provided great results in detecting relatively small 
and dense communities. Both simulation studies and applications on real networks show 
the effectiveness of the proposed harvesting algorithms. 

We believe the harvesting algorithm enables genuine analysis on community structures 
in highly asymmetric directed networks of real applications. Especially, the simple compu- 
tations that only rely on matrix multiplication and thresholding of vectors lead to further 
improvement of the algorithm through parallel and distributed computing. 
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Appendix A. Proofs for Section [3] 

We want to remark that the definition of directional components does not require the weight 
of links to be zero or one. Following proofs assume non-negative valued weights. 

A.l Proof of Theorem [3] 
Proof 

The proof of this theorem relies on the properties of Laplacian matrix of A, where the 
matrix A is 

W 
W l 



A 



lfG#lxRl r l, where S is the set of source nodes and T is the set of terminal nodes. 
The matrix A can be considered as the adjacency matrix of the bipartite graph expression 



of a directed graph (Zhou et al. , 2005; Guimera et al. , 2007). A bipartite graph is denoted 
by B = (<S,T, £). C is the set of all the edges that can be expressed as an ordered pair 

(M),s eS,teT. 

We denote B(G) the bipartite conversion of a directed graph G. We show that a 
directional component of G is equivalent to a connected component of B(G). First, let 
us show that a directional component (DC) is a connected component (C) by examining 
the connectivity and maximality conditions: 

• Connectivity: Any pair of nodes (s —> t) are connected in B(G). 

• Maximality: Assume that there exists a node that is connected to C but not a member 
of DC. Then there should be a directed edge starting from the node or ended at the 
node in G. In either case the node is a member of DC. It contradicts to the maximality 
of DC. Thus there is no such node. 

Similarly, we show that a connected component C in B(G) is a directional component DC 
in G. Any pair of nodes (s,t) 5 s E S D B(G),t G Tfl B(G) is D-connected in G by the 
connectivity in B(G). Maximality for a directional component is again obtained by using 
the maximality of C. 

A bipartite graph is an undirected graph thus we can apply the proposition 4 of 



Von Luxburg (2007) that shows us the equivalence between the number of connected compo- 



nents of an undirected graph and the multiplicity of the zero eigenvalue of graph Laplacian 
matrix of the undirected graph. Let L sym be a normalized graph Laplacian of A, which is 
defined by 



J sym 



i-Q, 



where, 



Qa = d 



A 




Q 
o 



(13) 
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and Da is the diagonal matrix of the row sums of A and it is equal to 



D A = 



D r 
D c 



The proposition 4 of Von Luxburg (2007) says that the multiplicity m of the eigenvalue 

zero of L sym is equal to the number of connected component in the undirected graph corre- 

i 

sponding to A and the eigenspace of zero is spanned by the vectors {D^1-c k , k = 1, . . . , m}, 
where lc k is the indicator vector for kth connected component. 

By the definition of L symi if A is an eigenvalue of L sym then 1 — A is an eigenvalue of 
Qa- It follows that the eigenvalue zero of L sym corresponds to the eigenvalue one of Qa- 
In fact, one is the principal eigenvalue of Qa because the eigenvalue zero is the smallest 
eigenvalue of L symi which is a non-negative definite matrix. 



By the standard result about the eigenvalues of Qa and the singular values of Q (see Horn 



and Johnson[ |1994| chap. 3), the principle singular value of Q is the principle eigenvalue of 

Sk eR\ s \,Dh Tk e 



Qa which is one. A vector D^lc k can be broken into two vectors D£ 

ii- II 
Rl^l, where Drls k is the first |5| entries of D^lc k and Dclr k is the last \T\ entries of 



D\lc k . By (13), the two vectors satisfy 



Dn Sk 



= QD£ l Tk 

= Q t D 1 n Sk , 

1, . . . , m} is a set of orthogonal vectors since 
i 

Sys are exclusive. The same argument holds for {Dc lr fc5 k = 1, . . . , m}. Thus, the pairs of 

ii 

vectors {(Dr ls k , D<? lr fc ) 5 k = 1, . . . , m} span the singular space of the singular value one 
of Q. 



as one can find in 



Dhillon 



(2001). {D?l Sk ,k 



A. 2 Lemmas for Proposition [4] 

Even though the directional components are initially developed for a decomposition of a 
directed graph, the same decomposition can be done to a non-negative matrix B if B is 
treated as a weight matrix of a directed graph whose nodes are mapped from all rows and 
columns. We call a sub-matrix of B a directional-component block if the sub-matrix is 
corresponding to a directional component of the directed graph generated from the weight 
matrix B. Following Lemmas are required for the proof of Proposition |1J 

Lemma 10 Let B E M m x W 1 be a matrix of a single directional- component block. Then, 
B l B is irreducible. 

Proof Notice that B is a nonnegative matrix and its row sums and column sums are 
all positive by the condition. The single directional-component block can be considered 
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as a weight mat rix of a directed graph, sa y G ,with a single directional component. By 

it suffices to show that (/ + B t B) r 



Lemma 8.4.1 in 



Horn and Johnson 



(1990), 



> 



for the irreducibility. B l B can be treated as the weight matrix of an undirected graph, 
H — (T, W s ), where T is the set of terminal nodes of G and W s is the weight matrix 
obtained by W s (iJ) = (B t B)^ j y H is a connected component by the D-connectivity of 
nodes in G. Then, (/ + B f B) is the weight matrix of a connected component since it is the 
weight matrix of a undirected graph, say J, that is obtained by adding self edges to all the 
nodes of H. To obtain (J + B^) 71 - 1 > 0, we need to show any nodes in J can reach any 
nodes in J at n — 1 steps. Since H is a connected component, the diameter of H is at most 
n—1 and thus any nodes can be reached within n — 1 steps from any nodes in J. Following 
the self edges as many as needed to make exact n — 1 steps, the desired result is obtained. 



Lemma 11 For any submatrix of Q, say Q s , the largest singular value of Q s is less than 
or equal to one ( ai(Q s ) < 1), and the equality holds only when Q s includes directional- 
component blocks. 

Proof 

Without loss of generality we may assume Q has only one directional component since 
directional-component blocks can be considered separately for the spectral property. If Q s 
includes the directional component block of Q, then cri(Q s ) = 1 by Theorem [5J Thus we 
showed that cri(Q s ) = 1 if Q s includes a directional component block. 

Notice the fact that any sub-matrix can be obtained by deleting columns and rows 
of Q sequentially. Q s does not include the directional component block of Q only if the 
deleted columns or rows are not zero vectors. Even though Q s may have more than one 
directional-component block, we only consider the directional-component block with the 
largest singular value since (Ji{Qs) is determined by it. 

We begin with Q s that is obtained by deleting a non-zero column of Q. Theorem 3.1.2 



of Horn and Johnson (1994) tells that <Ji(Q s ) is a solution of the optimization problem, 

(Ti{Qs) = max\\ x \\ 2=1 \\Q s x\ 



2- 



Notice that (Ji{Q s ) 2 — Xi(Q t s Q s ), where Xi(B) is the largest eigenvalue of a symmetric 
matrix B. 

QlQs is irreducible by Lemma 10 Applying theorem 8.4.4 of Horn and Johnson (1990) 



to QlQs tells us that there exist a vector x > 0, \\x\\2 = 1 such that Xi(QlQ s ) = HQs^Hi- ^ 
deduces a\{Q s ) = HQs^lh- 

Now, we show that (T\(Q S ) < (T\(Q). Consider W S1 a sub-matrix of W, which is obtained 
by taking the same index of Q s in Q. W s can be normalized to Q S1 in the same way that 
the full weight matrix W is normalized to Q. We know that ai(Q s ) = 1 and 

Qs — Qs + E 

where E is a matrix that has non-negative entries and at least one positive entry. Basically, 
this is because deleting a column of Q decreases at least one entry of the row sum vector. 
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The existence of a positive entry in E implies that 

0i(Qs) = max\\ x \\ 2=1 \\Q s x\\2 

< max\\ x \\ 2=1 \\(Q a + E)x\\ 2 

The inequality comes from x > and ||Q s x||2 < \\(Q S + E)x\\2- 

The case of deleting a row of Q can be done in the same way by taking the transpose 
of Q since singular values are invariant under taking a transpose. 

Now we know that a deletion of a column or a row of Q result in a reduction of the 
largest singular value. Deleting more than one columns or rows can be done sequentially, 
therefore, cri(Q s ) < &i(Q) if Qs does not include at least a directional-component block of 
Q. ■ 



A. 3 Proof of Proposition [4] 
Proof 

Given a solution u,v and C(S,T), notice that ||u||o = \S\ and ||v||o = \T\. We obtain 
a matrix Q{C(S,T)) by setting the rows and columns of Q that are not in 5, T to zero 
vectors. Then, ^ can be written as 

mfflc<7i(Q(C(S,T))) -nSZu{C(S,T)) 

Lemma [TT| tells us that cri(Q(DCi)) is equal to 1 and that is one of the largest among 
{ai(Q(C))|5Z w (C) > SZ u (DCi)}. Thus all C such that SZ U (C) > SZ^Dd) can be 
excluded from the consideration. Thus, to make DC\ the solution, we need to find 77 > 
satisfying 

1 - riSZuiDd) > ai(Q(C)) - riSZ u (C), 

for all C such that SZ^C) < SZ UJ (DCi). After an arrangement of above inequality, 77 
should satisfy 

„< i^om . (14) 

v< szUDd)-szuc) 1 ' 

SZ u (Dd) - SZ U (C) > by the condition of C and 1 - ai(Q(C)) > by Lemma[TTJ thus 
setting the right hand side of (14) as e finishes the proof. ■ 



A. 4 Proof of Theorem [8] 

Proof Express the objective function and the constraints by using a Lagrangian multiplier, 

min - u*z + A((l - qj)||u||| + a||u||i). (15) 

u,A 

Then, differentiate the objective function in ( fl5| ) by u and set it to zero, 

-z + A(2(l - a)u + aT) = 0, (16) 
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where I\ = sign(i^) if u\ ^ 0, otherwise, Yi E [—1,1]. The Karush-Kuhn- Tucker (KKT) 
conditions require A((l — a)||u||2 + ^|| u ||l — c i) = 0. If A > 0, the solution is 



u 



S(z, Xa) 
2A(1 - a) 



(17) 



A can be zero, if the solution is not on the boundary of the contraint. But it does not happen 
unless z is a zero vector. Thus, A > is chosen so that u satisfies the KKT condition. 



{I -a) 



5(z, Xa) 



2A(1 -a 

k-i 



+ a 



S(z, Xa) 



2A(1 - a) 



k-i 



V y V y 1=1 V y 1=1 



(18) 



where k satisfies \z\ 
becomes 



(k) 



< Xa < \z\ 



(k-iy 



Denote the threshold level, d = Aa, then (18) 



k-i 



k-i 



^2 Ed% " d ) 2 + h Ed% - d ))= 



(19) 



\ i=i i=i J 

where k satisfies < d < \z\( k _iy Using Lemma |9| one can determine the threshold 



level d of ([19]) by setting z and c = ci^Jr- In the actual algorithm, the value of A is not 
required for getting the solution. For the record, the corresponding A is presented, 



a 



(20) 



A. 5 Proof of Lemma [9] 

Proof As the first step, let us show that G z (-) is monotone decreasing function. We need 
to show that if d\ > d^ then G z (di) < G z (^2)- Denote k{d) for the k corresponding to the 
threshold level d. The first term of (fTTl) is monotone decreasing of d because 



k(d 2 )-l fc(di)-l 
^2 E d%)-^) 2 >^2 E ^~d2? (21) 

2 9 — 1 1 



1=1 1 2=1 

fe(di)-l 



>4 E (NW-^) 2 - (22) 



2=1 



The first inequality comes from ^(^2) < fc(di) and d\ > cfo. The second inequality comes 
from d\ > g?2- The second term of ( pTj ) can be done in the similar way and the desired 
result is obtained. 
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As the second step, we find the approximated solution of d from the set of {|^|(^)}i=i...n* 
By plugging in \z\^ to d in the increasing order of i, we can find k such that G z (\z\^) < c, 
GUI 

z \(k+iy) ^ c ^ ^ ne mon °tonicity of G z (-) and being c in the range of G z (-). This 

computation can be done efficiently by computing two cumulative sums, \ z \\i)}k=i...n 

and \z\(i)}k=i...m m the increasing order of i until k is obtained. A simple algorithm 
for finding k that satisfies condition (12) in this Lemma is provided in Algorithm [5j 



Algorithm 5 How to find k 



Require: (z\ > z 2 >,•••, > z n ), c > 
initialize Si <- 0, S2 <- 0,k <- 2 
for k = 2 : n do 

S\ <— Si + Zk-i 
S 2 <- S 2 + z\_ x 

G k = ^{S 2 - 2z k S 1 + (k — l)z 2 k ) + ^(^ - (k - l)z k ) 

if Gk > c then 
k<-k-l 
return k 

end if 
end for 
if Gk < c then 

fc ^— n 

return 
end if 

In the second step, we already know that < d < wn i° n m eans k = k fixed 

now. Therefore solving a quadratic equation of a, 

^E(N«- rf ) 2 + ^E(Nw- d ) = c ( 23 ) 

2=1 2=1 

determines the solution d. By the quadratic formula, the solution is 

l(z) 



EtrN 2 ^ 



4c + fc 



(24) 



knowing that d > 0. 



Appendix B. Derivation of d-Ncut 

We derive the directed-Normalized-cut (d-Ncut) criterion by mimicking the relationship be- 
tween Normalized-cut and the graph Laplacian matrix of a undirected graph ( | Von Luxburg , 
20071) . 
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Let us introduce a notation, f*.Mg := \ J2ij Wij(fi-gj) 2 , where W is an weight matrix. 
Given a community C(S,T), S is the source part of a directional component, and T is its 
terminal part. We also denote S, T the complement sets of S, T respectively. 

d-Ncut criterion is obtained by assigning the normalized membership vectors to f , g in 
f* Aig. Let f,geR n be piecewise constant vectors, 

i £ S 



0, ieS 



9j 



1 7 C T 

0, j € f 



(25) 



where Vol(S') is the sum of out-degrees of the nodes in S and Vol(T) is the sum of in-degrees 
of the nodes in T. Then f*./Wg can be rearranged into 



h3 



_ Cut(5,r) Cut(g,T) 1_\ 

-^c^ + ^oW + Cut(5 ' T) ^^py TwyJ • ( 6) 

Here f^A'/g measures the ratio of weights in external links and the balance of weights in 
5, T, which can be considered as a measure of the community structure of C(S,T). 

The d-Ncut criterion simultaneously considers a community and its complement, C(S, T) 
and C(S,T), and it is defined by 

d-Ncut (5, T) = f*Mg + ^Ms, 

where f , g are the membership vectors of 5, T. In other words, small d-Ncut value of a 
community implies that the community not only has a strong community structure in itself 
but also separates other good community. 

Furthermore, we want to discuss that regularized SVDs, Q and ([9]), have connection 
to the minimization of d-Ncut criterion. The argument is similar to connection between 
Normalized-cut and spectral clustering in undirected graphs ( [Von Luxburg , 2007). Since 



the minimization of normalized cut is well-known NP-hard problem, a relaxation in the 
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optimization problem is needed. The first term of d-Ncut criterion can be expressed as, 

h3 

lj2 W v(fi + 9j-2f igj ) 



2 

h3 



\ ( E <^ + E - 2 E w afi9i ) 



2 

^{i t D r i + g t D c g-2f t Wg) 



Notice that (25) implies f^I} r f = g t D c g = 1. Thus, minimizing ^.Mg is equivalent to 

min -PWg, subject to f*D r f = g f D c g = 1, (27) 



if the condition of piecewise constant vectors on f , g is dropped from (25). Further simpli- 
fication of (27) brings 

maxu^Qv, subject to u^u = v^v = 1. (28) 

u,v 

1 1 
where u = D? f , v = Dl g. 

Even though the relaxation converts the NP-hard problem to linear algebra problem that 

is easier to solve, it is unknown that how good the approximated solution is. Typically, the 

approximated solution provides global division of a graph. Our initial motivation was to 

give sparsity-inducing penalization on u, v, which resembles the sparse membership vectors, 

in order to recover small communities directly. 



Appendix C. Tables of Section 6.1.2 



Here we provide the results of community detection algorithms (i?7V-harvesting, DI-SIM, 
Infomap) on Cora citation network. The result of Lg-harvesting algorithm is displayed in 
Table |4j These tables show the number of papers in manually assigned categories for each 
community. 
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AI 


DSAT 


DB 


EC 


HA 


HCI 


IR 


Net 


OS 


Prog 


Uncategorized 


1 


6042 


320 


147 


115 


144 


129 


235 


45 


136 


255 


548 


2 


























6 








3 


























8 








4 


446 


385 


20 


47 


152 


31 


4 


3 


106 


655 


225 


5 


1800 


1256 


805 


305 


389 


1139 


190 


637 


1734 


2039 


1205 


6 


6 


5 


6 


130 


5 


60 





509 


133 


13 


95 


7 


41 


315 


3 


256 


8 





3 


1 


24 


2 


50 


8 


18 


25 


11 


250 


7 


27 


8 


51 


182 


28 


47 


9 


92 


2 


18 


1 


1 














73 


20 


10 























2 


12 





2 


11 


87 


67 


8 


29 


8 


1 


2 


3 


3 


3 


19 


12 


7 


3 





1 











25 


1 





4 


13 


14 


18 











1 





25 


6 


1 


2 


14 


2 
































15 


78 

















2 











1 


16 








1 





2 




















17 








33 





2 





1 











2 


18 


57 


1 


2 




















3 


10 


19 


55 


3 


6 




















2 


23 


20 























13 








5 



Table 5: Number of papers in the first twenty approximated directional components of 
E 1 TV-harvesting for each category. 





AI 


DSAT 


DB 


EC 


HA 


HCI 


IR 


Net 


OS 


Prog 


Uncategorized 


1 


687 


1084 


723 


575 


571 


416 


71 


750 


1176 


1933 


890 


2 


28 


4 




















1 





1 


3 


4 





























5 


4 


2650 


14 


18 


21 


6 


47 


95 


1 


5 


14 


144 


5 


120 


165 


78 


85 


177 


173 


13 


489 


1023 


1075 


332 


6 


100 


42 


5 


14 


13 


17 


5 


10 


18 


23 


20 


7 


2509 


1374 


269 


316 


373 


506 


126 


288 


305 


779 


519 


8 


13 


8 





18 





3 

















9 


4658 


406 


167 


149 


64 


485 


272 


20 


50 


148 


430 


10 


15 


7 


1 


3 


3 


4 





3 


2 





4 



Table 6: Number of papers in the source partition DI-SIM for each category. 
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A T 

Al 


JJdAI 






1 I A 

HA 




IK 


JNet 


Ud 


Prog 


Uncategorized 


1 


118 


1 





1 





1 














10 


2 


49 


87 


1 


24 


1 











1 


1 


6 


3 











6 


2 


2 





1 


36 


13 


2 


4 


7 





5 


1 


27 


5 


1 


2 


56 


16 


18 


5 


1 





7 


8 


30 


2 





1 


137 


11 


15 


6 











1 





3 





8 


1 








7 


222 




















3 








9 


8 


1 


1 


2 








1 





2 


38 


35 


15 


9 


























1 


66 


2 


10 


133 











1 





2 








1 


9 


11 


6 


80 

















1 


16 


4 


6 


12 











17 





43 





183 


8 





19 


13 


1 


16 








7 











22 


131 


8 


14 


166 





























11 


15 


1 


17 








2 











31 


38 


9 


16 


14 
































17 


4 


1 


2 


1 





5 








1 


116 


3 


18 

















6 





74 


17 


3 


8 


19 


1 


1 








1 











1 


103 


15 


20 


98 


3 


























6 



Table 7: Number of papers in the randomly selected 20 communities detected by Infomap 
for each category. 
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